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ABSTRACT 


In the learning sciences, heterogeneity among students usu- 
ally leads to different learning strategies or patterns and 
may require different types of instructional interventions. 
Therefore, it is important to investigate student subtyp- 
ing, which is to group students into subtypes based on their 
learning patterns. Subtyping from complex student learn- 
ing processes is often challenging because of the informa- 
tion heterogeneity and temporal dynamics. Various inverse 
reinforcement learning (IRL) algorithms have been success- 
fully employed in many domains for inducing policies from 
the trajectories and recently has been applied for analyzing 
students’ temporal logs to identify their domain knowledge 
patterns. IRL was originally designed to model the data by 
assuming that all trajectories have a single pattern or strat- 
egy. Due to the heterogeneity among students, their strate- 
gies can vary greatly and the design of traditional IRL may 
lead to suboptimal performance. In this paper, we applied 
a novel expectation-maximization IRL (EM-IRL) to extract 
heterogeneous learning strategies from sequential data col- 
lected from three simulation environments and real-world 
longitudinal students’ logs. Experiments on simulation en- 
vironments showed that EM-IRL can successfully identify 
different policies from the heterogeneous sequences with dif- 
ferent strategies. Furthermore, experimental results from 
our educational dataset showed that EM-IRL can be used 
to obtain different student subtypes: a “learning-oriented” 
subtype who learned the material as much as possible re- 
gardless of the time in that they spent significantly more 
time than the other two subtypes and learned significantly; 
an “effictent-oriented” subtype who learned efficiently in that 
they not only learned significantly but also spent less time 
than the first subtype; a “no learning” subtype who spent 
less amount of time than first subtype and failed to learn. 
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1. INTRODUCTION 


With the rapid development of educational technologies, 
longitudinal students’ learning progression trajectories are 
readily available. It is often challenging to analyze large- 
scale heterogeneous progression trajectories to infer high- 
level information embedded in student subgroups. This chal- 
lenge motivates the development of student modeling [1, 2, 
3, 4] and instructional intervention [5, 6, 7, 8]. 


Student subtyping, which seeks student groups with sim- 
ilar learning progression trajectories, is crucial to address 
the heterogeneity in the students, which ultimately leads to 
personalized instruction where students are provided with 
interventions tailored to their unique learning status. Stu- 
dent subtyping facilitates the investigation of different types 
of pedagogical strategies. From the data mining perspec- 
tive, student subtyping is posed as an unsupervised cluster- 
ing task of grouping students according to their historical 
records. Since these records are longitudinal and interre- 
lated, it is important to capture the dependencies among 
the elements of the recorded sequence to learn more effec- 
tive and robust representations, which can be utilized in the 
clustering stage to obtain the student subgroups. 


This work aims at investigating student subtyping based on 
their pedagogical strategies, which can be seen as a process 
of self-regulated learning [9, 10, 11, 12, 13] by setting one’s 
learning goals and ensuring the goals to be attained. Specif- 
ically, we focus on students’ pedagogical decision-making 
strategies during their interactions with an intelligent tutor- 
ing system (ITS) to learn the probability. In this ITS, once a 
problem is presented, the students will decide whether they 
want the ITS to tell them how to solve the next problem 
or complete the next step, by presenting a worked exam- 
ple, or they want the ITS to elicit the next problem or 
take the next step themselves, by requiring problem solv- 
ing. When making pedagogical decisions, the students have 
to self-regulate their own learning process which may change 
the learning outcomes even though the instructional content 
is controlled. We believe that students’ pedagogical strate- 
gies are closely related to metacognition, i.e., the processes 
involved in thinking about thinking [14]. 


Reinforcement learning (RL) offers one of the most promis- 
ing approaches to induce effective pedagogical strategies di- 
rectly from data. A number of researchers have studied ap- 
plying RL to improve the effectiveness of ITSs, e.g. [15, 7, 
16, 17, 18, 19, 20, 8, 21, 22], and much of the prior work fo- 
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cused on inducing effective policies that determine the best 
action for the ITS to take in any given situation so as to 
maximize a cumulative reward, which is often the student 
learning gain. On the other hand, in this work, our goal 
is to infer students’ pedagogical strategies based on their 
behaviors and decisions while interacting with the ITS. 


To do so, we applied inverse reinforcement learning (IRL). 
Unlike RL, where the reward function is explicitly given as 
input, IRL takes a bunch of trajectories as input and from 
which a reward function will be inferred. Given this inferred 
reward function, the RL can be further deployed to induce 
the decision-making policy. Since the students’ decisions are 
generally made based on a trade-off among various complex 
factors, e.g., time, learning gain, difficulty of problems, etc., 
merely taking the learning gain as the reward cannot re- 
flect the actual decision-making patterns. As a result, we 
employed IRL to learn students’ strategies based on their 
behavioural data. Recently, IRL has been widely employed 
in various domains to understand how decisions are made in 
the given trajectories [23, 24]. Specifically, it has been em- 
ployed in educational domains to analyze students’ temporal 
log data to identify their domain knowledge patterns [25, 
26]. However, IRL was originally designed to model the 
data by assuming that all trajectories share a single pattern 
or strategy. Considering the heterogeneity among students, 
their pedagogical strategies can vary greatly and the de- 
sign of traditional IRL may lead to suboptimal performance. 
Though we can apply IRL individually for each student, it 
will forfeit our goal of revealing some general and meaning- 
ful patterns from students’ trajectories in consideration of 
the heterogeneity among subgroups of students. 


We employed a novel expectation—maximization IRL (EM- 
IRL) algorithm [27] to model the heterogeneity among stu- 
dent subtypes by assuming that different student subtypes 
have different pedagogical strategies and students within 
each subtype share the same strategy. The EM-IRL would 
recursively cluster students into different subgroups and in- 
duce a policy for each group by IRL until both clusters and 
policies get converged. In the original EM-IRL work, it re- 
quires the number of clusters to be pre-defined [27]. How- 
ever, when applying it to student subtyping in education, it 
is often hard to figure out beforehand how many types of 
strategies are involved in students’ trajectories. Therefore, 
we embedded the original EM-IRL into a general framework 
which can automatically determine the optimal number of 
clusters from the data. 


In this work, we evaluated our general framework on three 
simulation environments: Grid World, Highway, and Moun- 
tain Car, and on real-world longitudinal students’ logs col- 
lected from an ITS. Our results in three simulation envi- 
ronments showed that EM-IRL could accurately cluster the 
data with different decision-making strategies. In addition, 
the experimental results showed that EM-IRL could be eas- 
ily employed to obtain the student subtypes. Specifically, we 
got three student subtypes: a “learning-oriented” subtype 
who try to learn the material as much as possible regardless 
of the time spent and they learned significantly from pre- 
to post-test; an “effictent-oriented ” subtype who learn effi- 
ciently in that they not only learned significantly but also 
spent significantly less time than the first subtype; a “no 


learning” subtype who spent the less time and failed to learn. 
The clustering results suggested the potential of targeting 
the students who are not using effective pedagogical strate- 
gies, adapting the interventions, and offering the students 
effective pedagogical skill training through the ITS. 


The remaining parts are organized as follows. In Section 2, 
related works are reviewed. Section 3 presents the methods, 
including the RL, IRL, and EM-IRL. Section 4 displays pre- 
liminary results we got in three simulation environments. 
Section 5 details data collected from the ITS. In Section 
6, we discuss the experimental setup for EM-IRL and some 
other clustering methods. Section 7 presents the experimen- 
tal results. Finally, Section 8 summarizes the paper. 


2. RELATED WORKS 
2.1 Students’ Subtyping 


Previous research has widely explored modeling of student 
subtyping to assist teachers in providing more targeted in- 
terventions at the right time. Generally, student subtyp- 
ing was analyzed via unsupervised clustering methods. For 
example, Lopez et al. employed an expectation maximisa- 
tion clustering method to determine if the students’ partic- 
ipation in course Moodle forum could be a good predictor 
of the final marks [28]. Durairaj and Vijitha applied K- 
means clustering to predict the pass/fail percentage of the 
students who appeared for a particular examination [29]. 
Khalil and Ebner clustered the students into appropriate 
categories based on their level of engagement [30], so that 
the teachers could increase retention and improve interven- 
tions for specific sub-population. All of these methods were 
based on the static data, without considering the dynamic 
properties during learning. 


With the rapid development of e-learning, an increasing 
amount of sequential data was collected via ITSs. In general, 
the clustering methods to handle sequential data could be 
generalized into three categories: proximity-based, feature- 
based, and model-based [31]. More specifically, proximity- 
based methods measures the similarity between the pair- 
wise data via the distance calculated by the longest com- 
mon subsequence, dynamic time warping, etc. For exam- 
ple, Shen and Chi proposed a temporal clustering frame- 
work which measured pair-wise distance between the stu- 
dents by dynamic time warping and then clustered them by 
hierarchical clustering [32]. Their method identified some 
distinctive patterns among the clusters, which could pro- 
vide benefits to the personalized learning. Feature-based 
approaches would first compress the sequential data to be 
static, then the clustering methods taking static data as in- 
put could be further employed. For example, in [33] and [34], 
the authors aggregated the students’ activities to a feature 
vector and then applied K-means clustering to recognize 
learner groups in exploratory learning environments. In the 
model-based methods, the similarity of two data could be 
calculated based on the likelihood of one of them given the 
model derived from the other. For example, Li and Yoo 
proposed to use a Markov chain based clustering methodol- 
ogy to model the students’ online learning behaviors col- 
lected during the learning process for more effective and 
adaptive teaching [31]. Additionally, Kock and Paramythis 
proposed a method combining K-means clustering with dis- 
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crete Markov models to identify new, semantically meaning- 
ful problem-solving styles of the learners [35]. 


2.2 Students’ Pedagogical Strategies 

A number of researchers have investigated students’ peda- 
gogical decision-making [36, 37, 38, 39, 40, 41]. Previous 
research has shown that students make pedagogical deci- 
sions strategically. For example, Aleven et al. conducted 
a study to investigate students’ hint usage behavior [36]. 
Results showed that students used the easy-to-apply intelli- 
gent help more often than the Glossary. However, students 
often waited long before asking for a hint. When requesting 
hints, they often skipped the intermediate hints to reach the 
bottom-out hint which showed the solution directly. The 
results suggested that students preferred less effort-taking 
help (intelligent help and bottom-out hint), and oftentimes, 
they used the help less than they needed. 


Additionally, prior research showed that providing students 
with pedagogical decision-making assistance could result in 
better decision-making skills or learning performance. Roll 
et al. [37] examined the relationship between students’ help- 
seeking patterns and learning performance. They found 
that asking for help on challenging steps was generally pro- 
ductive while help abusing behaviors were correlated with 
poor learning. Mitrovic et al. [38] compared three types 
of decision-making modes: system control, student control, 
and faded control. Under the faded control, the system se- 
lected the problem for the student at the beginning of the 
training and gave explanations of why the problems should 
be selected. As the training proceeded, the control was given 
to the students. Results showed that the faded control group 
demonstrated improved problem selection skill and achieved 
better learning gain than the other two groups. Long et 
al. [39] compared an assistance condition, where problem se- 
lection assistance was provided, with standard condition (no 
assistance). Their results showed the assistance condition 
achieved significantly better learning performance and bet- 
ter declarative knowledge of a key problem-selection strategy 
comparing to the standard condition. 


2.3. Learning From Demonstrations 

Learning from demonstrations [42], also known as imitation 
learning [43] or apprenticeship learning [44], is a process to 
reproduce the decision-making behaviors in demonstrated 
trajectories. Generally, the methods in this area can be cat- 
egorized into two groups: 1) directly learning a policy as a 
state-action mapping by parroting the demonstrated behav- 
iors, which is typically done via supervised learning; and 2) 
inferring rewards from the demonstrations and then apply- 
ing reinforcement learning (RL) to induce the policy, which 
is called inverse reinforcement learning (IRL). The latter is 
generally preferred because the reward is a more robust, suc- 
cinct, and transferable definition for a task [45]. Specifically, 
comparing to supervised learning, IRL has higher general- 
ization ability to robustly learn from smaller size trajectories 
collected from larger state spaces, and the succinctly repre- 
sented reward function can be handily transferred to other 
agents in different scenarios. 


Based on how the rewards are inferred, existing IRL algo- 
rithms can be generalized into two categories: maximum 
margin-based methods and probabilistic model-based meth- 


ods. Specifically, maximum margin-based methods infer re- 
wards by finding a model to maximizes the margin between 
the demonstrated trajectories and other alternative behav- 
iors [44]. However, it is often suffers from the ill-posed issue 
with non-uniqueness [45], i.e., there can be multiple reward 
functions to explain the demonstrated behaviors. Proba- 
bilistic model-based methods, on the other hand, are able 
to handle this issue by using probability distributions to in- 
troduce preferences over reward functions [46]. In this cat- 
egory, Ramachandran and Amir [47] proposed a Bayesian 
IRL, which combined prior knowledge and evidence from the 
demonstrated trajectories to derive a probability distribu- 
tion over the reward functions. Similarly, Ziebart et al. pro- 
posed a maximum entropy IRL which results in the least bi- 
ased estimation of the reward function [23]. Babes-Vroman 
et al. [27] proposed a maximum likelihood IRL (MLIRL), 
which finds the reward function that maximize the proba- 
bility to observe the demonstrated behaviors. Their experi- 
mental results showed that the MLIRL outperformed some 
other IRL methods, including the linear programming based 
maximum margin IRL and maximum entropy IRL. 


All the above methods assume a single reward function for 
all demonstrations. Some other approaches have been pro- 
posed to handle the multiple reward functions. Dimitrakakis 
and Rothkopf [48] proposed a Bayesian multi-task IRL, which 
learns a reward function for each individual trajectory us- 
ing the same prior distribution. Choi et al. [49] proposed a 
method based on nonparametric Bayesian IRL in which the 
prior of mixing distribution of different rewards was modeled 
by the Dirichlet process. Babes-Vroman et al. [27] proposed 
an EM-based framework, which iteratively computes the 
probabilities that the demonstrations belong to each clus- 
ter and updates the cluster-wise rewards based on MLIRL. 
Considering the efficiency and good performance EM-based 
method, we adapted it for analysis in this work. 


Recently, IRL has been widely applied in various domains. 
Ziebart et al. [23] employed it in driver route modeling for 
predicting driving behaviors as well as for route reeommen- 
dation. Asoh et al. [24] applied IRL to medical records and 
explored the potential rules in doctors’ diagnosis. Of most 
relevance, IRL also showed effectiveness in educational do- 
main. Rafferty et al. applied IRL in education applica- 
tions to automatically infer learners’ beliefs in an education 
game [25]. They demonstrated that IRL could recover the 
participant’s beliefs towards how their actions could affect 
the environment, which indicated the potential to utilize IRL 
to interpret data from interactive educational environments. 
Then in another of their work, IRL was further employed to 
assess learners’ mastery of some skills in solving algebraic 
equations. Based on the learned IRL results, some skills the 
learners misunderstood could be detected and personalized 
feedback for improving the skills were further rendered [26]. 


3. METHOD 


3.1 Reinforcement Learning 

Markov decision process (MDP) was widely utilized to model 
the user-system interactions. The central idea behind re- 
inforcement learning (RL) is to transform the problem of 
inducing effective policies into a computational problem of 
finding an optimal policy for choosing actions in MDP. An 
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MDP describes a stochastic control process using a 5-tuple 
< S,A,T,R,y >. Taking the pedagogical policy induction 
as an example, S indicates the learning environment states, 
which is often represented by student-system interaction fea- 
tures. A denotes the tutor’s possible actions, such as elicit or 
tell. The reward function R is generally assigned as students’ 
learning performance. The transition probability T can be 
estimated from training data. y € [0,1) denotes a discount 
factor for the future rewards. Given a defined MDP, we can 
transform our student-system interaction logs into trajecto- 
. Qa1;5T1 a2;T2 ansTn aq,Ti 

ries as: $1 > 82 yt Sn >. Here 5; ——> si41 
indicates that at the i*” turn, the learning environment was 
in state s;; the tutor executed action a; and received reward 
ri; then the environment transferred into the state s;41. 


In traditional RL, the reward function R serves as a guidance 
to praise or punish the agent’s behaviors to fulfil a certain 
task when interacting with the environment. Therefore, it 
is essential and needs to be elaborately hand-crafted in ad- 
vance to reflect the task. In ITS, the reward is generally for- 
mulated as the students’ learning performance, e.g., learn- 
ing gains, since the intention of tutor’s decision-making is to 
promote students’ learning. However, the reward function in 
students’ decision-making is more complex to be determined: 
students may have various learning patterns, e.g., finishing 
the process as quick as possible or working hard regardless 
of the time, which is cumbersome to be manually encoded in 
a reward function. The different reward functions reflected 
the different strategies students employed during the train- 
ing process. Therefore, if student’s reward function can be 
learned in a data-driven manner, we can better understand 
their pedagogical decision-making strategies. 


3.2 Inverse Reinforcement Learning 
3.2.1 General IRL 


The difficulty of the reward function design triggered the de- 
velopment of the inverse reinforcement learning (IRL). IRL 
follows a reverse procedure comparing to the traditional RL: 
in RL, given the reward function, the agent will learn an op- 
timal policy; while in IRL, the trajectories derived from the 
optimal policy are given, from which the agent will learn the 
reward function. It can be described as a stochastic control 
process using a 4-tuple MDP\R =< S,A,T,y7 > where the 
reward function is missing. 


In general framework of IRL, the input is a MDP\R to- 
gether with some demonstrated trajectories 7. The reward 
function Rg parameterized by @ can be modeled as either a 
linearly weighted sum of feature values or belonging to a cer- 
tain distribution. Most of the existing IRL methods follow 3 
steps: in step 1, the parameter 6 is randomly initialized; in 
step 2, given the R», general RL methods can be applied to 
induce the policy; In step 3, the divergence of the behaviors 
regarding to the learned policy and the given trajectories 
is minimized to update the 0. The step 2 and step 3 are 
repeated until the divergence is reduced to a desired level. 


To investigate students’ pedagogical strategy, we can feed 
their decision-making trajectories into the IRL model. Once 
the reward function is learned, the strategy can be fur- 
ther induced via traditional RL methods. Herein, we com- 
pared some most commonly utilized IRL methods including: 
quadratic programming based maximum margin IRL [44], 


General Process of IRL 


Input MDP\R =< S,A,T,y7 > and trajectories T 
Output Re 

step 1 Initialize the parameter 6 in reward function 
step 2 Solve the MDP to learn the policy 7 

step 3 Update the optimization @ to minimize the diver- 
gence between 7 and behaviors following the 7 
Repeat step 2 and step 3 until convergence 


maximum entropy IRL [23], Bayesian IRL [47], and maxi- 
mum likelihood IRL (MLIRL) [27] over three online simula- 
tion environments (i.e., Grid World, Highway, and Mountain 
Car). We found MLIRL always outperformed others and it 
is also most time-efficient. As a result, we take MLIRL for 
the IRL-based analysis hereinafter. 


3.2.2. Maximum Likelihood IRL 

To formally define the maximum likelihood IRL, we denote 
the input N demonstrated trajectories as T = {&1,...,En} 
and each trajectory is composed of a set of state-action pairs: 
€& = {(s1, a1), (s2,a@2),...}. The reward function is defined 
as the linear function of feature vector for state-action pairs: 
ro(s,a) = 07 6(s,a). Then the Q-value can be calculated as: 


Qo(s,a) = 0" ¢(s,a) + y> > T(s, a, 8 \&K)Qo(s’, a’), (1) 


dia Qa(s, a) exp(BQa(s, a)) 
Yar exp(BQa(s, a’)) 


where &)Qe(s, a) = (2) 


Eq. 2 shows the Boltzmann exploration. Comparing to stan- 
dard Bellman equation, it enables the likelihood to be differ- 
entiable, thus the objective function can be easier optimized. 
6 represents the degree of confidence and it is set as 0.5 in 
our experiments. The Boltzmann exploration policy param- 
eterized by @ is: 


exp(8Qo(s, a)) 
Ya exP(BQo(s, a’)) 


Then the log-likelihood of trajectories J is calculated as: 


L(T|@) = log Il Il mo(s,a)"! = S> Se wilogme(s, a) 


i=1(s8,a) ECE; i=1 (s,a)€&; 
(4) 


Herein, w; denote the weight for €;, which can be estimated 
by its frequency of the occurrence. By maximizing the Eq. 4, 
the parameter @ enables the trajectories T to have highest 
probability to be observed given the reward function Ro. 
Once the reward function is learned, the strategy followed 
by 7 can be further induced by any RL method, e.g., policy 
iteration that we employed in this work. 


(3) 


To(s,a) = 


In general, IRL methods assume the reward function to be 
unique for all input trajectories. However, it is often the 
case that the trajectories are heterogeneous and have various 
reward functions. For example, in ITS, students’ decision- 
making behaviors can have different patterns which cannot 
be easily captured by a single IRL model. As a result, a 
model suitable for multiple reward functions is favored. 
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Algorithm: MLIRL 


Algorithm: EM-IRL 


Input MDP\R, trajectories 7, trajectories’ weights w;, 
i=1,...,N, learning rate a 
Initialize reward parameter 0 randomly 
Repeat 
Learn the policy 7 
Compute L = 97; )0(s,ayce, wilog(mo(s, @)) 
Update 9=O0+aV7L 
Until target number of iterations completed 


3.3. Expectation—maximization IRL 

To deal with trajectories with multiple reward functions, 
i.e., multiple strategies, Babes-Vroman et al. [27] proposed a 
straight-forward expectation—maximization IRL (EM-IRL). 
Herein, we adapted the original EM-IRL to automatically 
determine the optimal number of clusters. Instead of di- 
rectly assigning the cluster number, we considered a possi- 
bly maximal number of clusters, i.e., Kmaz, and a variable 
k initialized as 2 indicating the current cluster number. 


Specifically, to determine the optimal number of clusters, 
starting from the cluster number k = 2, we iteratively imple- 
mented the EM procedure, until a pre-defined stop_criteria 
was met. The stop_criteria was defined as: either there were 
some empty clusters generated or the log-likelihood (LL) of 
the clustering results defined in Eq. 5 varied smaller than a 
pre-defined threshold comparing to the last iteration, which 
we set as 10. The LL reflected the clustering performance by 
measuring the accordance of learned clusters with the cor- 
respondingly induced cluster-wise strategies. In Eq. 5, N; 
stands for the number of trajectories in cluster 7. 


LL = VY log (zis) (5) 


g=li=1 
a = 76; (s,a)p; 
(s,a)E&; 


Before the EM loop, parameters p; and 0;, 7 = 1,..,k, which 
denoted the estimated prior probability and reward param- 
eter for the j’” cluster were randomly initialized. 


In the EF step, the probability that trajectory 7 belongs to 
cluster 7 was calculated by Eq.6, in which Z is a normaliza- 
tion factor; In the M step, the prior probability of cluster is 
updated by Eq. 7. Meanwhile, the reward parameter 6; can 
be learned by any IRL and herein we employed the MLIRL 
with weights of trajectories being z;;. 


Zij 
R= 2D 7 (7) 


The E step and M step will be iteratively executed until a 
target number of iterations is completed, which was set as 80 
in this work to ensure the convergence. Finally, we found k 
clusters when LL got converged, with each cluster standing 
for a group of trajectories with an unique reward function. 


Based on these reward functions, we could further induce 
the cluster-wise strategies. 


4. SIMULATION ENVIRONMENTS 


Input MDP\R, trajectories 7, maximal number of clus- 
ters Kmaz 
Initialize k = 2 
While k < Kmaz 
Initialize p; and 6;, j =1,...,k, randomly 
Repeat 
E Step: Compute the z:;,i=1,...,N 
M Step: Update the prior probability p;; and 
Learn reward parameter 6; via MLIRL 
Until target number of iterations completed 
If stop_criteria is True: Break; Else: k =k+1 


Since the ground-truth of students’ subtypes were unknown 
in advance, it is difficult to directly evaluate the EM-IRL 
learned clusters from the students’ data. Thus, we first car- 
ried out EM-IRL in three simulation environments which 
had decided ground-truth. If different strategies could be ac- 
curately distinguished by EM-IRL in simulations, we would 
be more confident to further deploy it in ITS environment. 


4.1 Environment Settings 
We explored three simulation environments including Grid 
World, Highway, and Mountain Car, as shown in Figure 1. 


Grid World: adapted from [27], in which three grids were 
randomly chosen as puddles indicated by bricks in Figure 1(a). 


e States (25) 5x5 grid-size. 
e Actions (4) Moving to up, down, left, or right. 


e Strategies (3) Moving to the 1) upper-right corner; 2) 
lower-left corner; or 3) lower-right corner. 


The rewards are designed for the three strategies: 1) Upper- 
right corner has the reward of 10; 2) Lower-left corner has 
the reward of 10; 3) Lower-right corner has the reward of 
10. Otherwise, each state was punished -1. 


Highway: adapted from a three-lane highway scenario in- 
troduced in [50], in which the agent controlled a blue car 
with three speed levels, which could switch between the 
three lanes or go off-road on either side. At all timestamps, 
there would be a red car in one of the three lanes. 


e States (729) the blue car’s speed had 3 levels and could 
move horizontally in 9 locations; the red car could move 
vertically in 9 locations and horizontally in 3 locations. 


e Actions (5) Staying at the current state, speeding up, 
slowing down, moving left, or moving right. 


e Strategies (2) 1) Keeping off the left lane (suppose it is 
under construction); 2) Driving at the fastest speed. 


The rewards are designed for the two strategies: 1) Driving 
on the left lane has the reward of -10; 2) Driving with the 
lowest level of speed has the reward of -10. In both strate- 
gies, off-road is punished -0.5, collision is punished -5, and 
maintaining the state has no reward. 


Mountain Car: adapted from the MountainCar-v0 in Ope- 
nAI Gym [51], in which a car was on a one-dimensional track 
and moves between two mountains. 


e States (80) 10 horizontal positions with 8 levels of speed. 
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(a) Grid World 


(b) Highway 


(c) Mountain Car 


Figure 1: Three simulation environments: (a) Grid World; (b) Highway; (c) Mountain Car. 


Table 1: Cluster-wise and overall purities by EM- 
IRL clustering in three simulation environments. 


mecaiech Cluster-wise Overall 
: Strategy Idx | Purity (%) | Purity (%) 
aE 100 
Grid World 2 100 100 
3 100 
. 1 100 
Highway 9 100 100 
. ih 100 : 
Mountain Car 9 93.2 96.4 


e Actions (3) Pushing left, no pushing, or pushing right. 


e Strategies (2) 1) Reaching to the right mountain top 
(the car needs to drive back and forth to build up enough 
momentum to push up); 2) Parking at the valley bottom. 


The rewards are generated for the two strategies: 1) Right 
mountaintop has the reward of +10; 2) Valley bottom has 
the reward of +10. Otherwise, each state is punished -1. 


In each environment, the initial states were randomly as- 
signed, the transitions between states were stochastic and 
estimated from the data. For each strategy, we induced a 
policy via policy iteration and employed it to collect trajec- 
tories. Specifically, the number of collected trajectories for 
each strategy was 500, 1000, and 1000 in three environments, 
respectively. In each environment, trajectories with various 
strategies were mixed together and fed into the EM-IRL. 


Given the ground-truth of cluster-belongings in simulation 
environments, the results of EM-IRL were evaluated by the 
purity of each cluster and across overall clusters. Denote the 
size of i*” cluster as N; with ground-truth labels L;, then the 
cluster-wise purity is calculated as the number of majority 
labels divided by the cluster size, i.e., purity; = masorsty(ls). 
and the overall purity is calculated by the mean of purity 
among all clusters, i.e., purity = i ae purityi. 


4.2 EM-IRL Results in Three Simulations 


The EM-IRL clustering results for the three simulation envi- 
ronments are shown in Table 1, in which the first column is 
the environment; second and third columns show the index 


of strategy and the corresponding cluster-wise purity; the 
last column show the overall purity among all clusters. 


In Grid World, all strategies could be accurately clustered by 
EM-IRL. Specifically, both cluster-wise purities and overall 
purity were 100%. Likewise, in Highway, the two strate- 
gies were accurately clustered with the purity of 100%. In 
Mountain Car, a few trajectories of driving to right moun- 
taintop (Strategy 0) were mis-clustered to the parking at 
valley (Strategy 1). This is because the mis-clustered tra- 
jectories tried to move to left to collect enough momentum, 
which showed very similar behaviors to reaching the valley. 
Overall, the results suggested the effectiveness of EM-IRL in 
accurately distinguishing subtypes of trajectories with dif- 
ferent strategies in all three simulation environments. 


5. ITS LEARNING ENVIRONMENT 


Our data was collected by letting students work on a web- 
based ITS, which taught college students probability, e.g., 
Addition Theorem and Bayes’ Theorem. The instruction 
was conducted by guiding students go through training prob- 
lems. For each problem, the tutor provided step-by-step in- 
struction, immediate feedback, and on-demand help. The 
help was provided via a sequence of increasingly specific 
hints. The last hint in the sequence, i.e., the bottom-out 
hint, told the student exactly what to do. During training, 
the students could make pedagogical decisions on whether 
to solve the next step by themselves or observe the tutor to 
solve it. If they choose to solve by themselves, the tutor will 
elicit the solution from them by asking questions; otherwise, 
the tutor will show or tell them the solution directly. 


5.1 Data Collection 


All students participating in our data collection went through 
four phases: textbook, pre-test, training, and post-test. Dur- 
ing textbook, all students studied the domain principles from 
a probability textbook. They read a general description of 
each principle, reviewed some examples of it, and solved 
some single- and multiple-principle problems. Then the stu- 
dents took a pre-test which contained 14 problems. During 
this phase, they would not be given feedback on their an- 
swers, nor be allowed to go back to earlier questions (this 
was also true for the post-test). During the [T'S training 
procedure, students received 12 problems in the same order. 
Each main domain principle was applied at least twice. The 
minimal number of steps needed to solve each training prob- 
lem ranged from 20 to 50. Such steps included variable def- 
initions, principle applications, and equation solving. The 


Proceedings of The 13th International Conference on Educational Data Mining (EDM 2020) 274 


number of domain principles required to solve each problem 
ranged from 3 to 11. Finally, all students took the post-test 
which contained 20 problems in total. 14 of the problems 
were isomorphic to the problems given in the pre-test phase, 
while the remaining 6 were harder non-isomorphic multiple- 
principle problems. 


The pre- and post-tests required students to derive an an- 
swer by writing and solving one or more equations. We used 
three scoring rubrics: binary, partial credit, and one-point- 
per-principle. Under the binary rubric, a solution was worth 
1 point if it was completely correct or 0 if not. Under the 
partial credit rubric, each problem score was defined by the 
proportion of correct principle applications evident in the so- 
lution. A student who correctly applied 4 of 5 possible prin- 
ciples would get a score of 0.8. The one-point-per-principle 
rubric in turn gave a point for each correct principle applica- 
tion. All of the tests were graded in a double-blind manner 
by a single experienced grader. The results we presented 
were based upon the partial-credit rubric but the same re- 
sults hold for the other two. For comparison purposes, all 
test scores were normalized to the range of [0, 100]. 


We measure students’ learning performance using normal- 
ized learning gain (NLG), which measured their gain irre- 
spective of their incoming competence. It is calculated as: 
NLG = poe Pre where pre and post refer to the students’ 
test scores before and after the ITS training respectively and 
100 is the maximum score. Herein, for the post-test, we con- 
sidered all 20 problems that are either isomorphic and non- 
isomorphic. In addition, an isomorphic NLG (Iso_NLG) was 
also measured. Unlike NLG, the Iso.NLG was calculated 
based on the pre- and isomorphic post-test scores, which 
contained only 14 isomorphic multiple-principle problems. 


5.2 States & Actions 


Our dataset contains 127 students. Each student spent ~ 2 
hours on the system and completed around 400 steps. 


States 142 state features were extracted from the student- 
system interaction log data. Specifically, the features can be 
grouped into five categories: 


e Autonomy (10 features): the amount of work done by a 
student, such as the number of elicits since the last tell; 


e Temporal (29): time related information about the stu- 
dent’s behavior, such as the average time per step; 


e Problem Solving (35): information about the current 
problem solving context, such as problem difficulty; 


e Performance (57): information about the student’s per- 
formance so far, such as the percentage of correct entries; 


e Hints (11): information about the student’s hint usage, 
such as the total number of hints requested. 


For each category, we employed K-means clustering to get 
the discretized states. By selecting an elbow of errors when 
the clustering results got converged, the number of states 
for each category of features was determined as follows: Au- 
tonomy (3 states), Temporal (4), Problem Solving (3), Per- 
formance (4), and Hints (3). As a result, we got 432 discrete 
states totally. Based on the discretized states, we estimated 
the transition probabilities from all available data. 


Actions The students can take two action of elicit/tell, i-e., 
to elicit the solution by themselves through asking questions, 
or to let the tutor tell them the solution directly. 


6. EXPERIMENTAL SETTINGS 
6.1 Student Subtyping by EM-IRL 


Based on the EM-IRL learned clusters, we conducted anal- 
yses by checking the statistical significance among different 
clusters’ learning performance, including the pre-test scores, 
isomorphic NLG (Iso_NLG), NLG, students’ learning time 
on the training task (Time), and the percentage of elicit in 
students’ decisions (Elicit_Perc). 


6.2 Student Subtyping by Other Methods 
6.2.1 Clustering by Traditional Methods 


To evaluate the clustering performance of EM-IRL, we com- 
pared it with three other clustering methods: two K-means 
based approaches that took the pre-test scores and the learn- 
ing state in the final step as the input respectively and a 
K-medoids based approach that took dynamic time warping 
(DTW) [52] distance between trajectories as the input. The 
K-means based approaches were static-information-based clus- 
tering while the K-medoids based DTW considered dynamic 
state transitions in the trajectories. In our experiments, each 
of these methods generated three clusters and for each clus- 
ter, the MLIRL was employed to learn a strategy. Based on 
the learned strategies, we calculated the log-likelihood (LL, 
referring to Eq. 5) of observing such clustering results. 


6.2.2 Clustering by Matching RL/ IRL Policies 

We further explored whether RL or IRL policies could model 
the heterogeneity in student decision-makings. The inducing 
of these two policies are detailed as follows. 


Inducing the RL policy: To investigate whether students’ 
learning strategies could be distinguished from the tutor’s 
perspective, we compared students’ decisions to a RL in- 
duced pedagogical policy and clustered the students based 
on the matching rate. Since the RL policy was induced with 
the goal of improving students’ learning performance, it is 
expected that the group with a higher matching rate with 
the RL policy would have better learning performance. 


Specifically, we applied RL to learn a pedagogical policy 
that determines whether the next step should be elicit or tell 
(the same decisions students made in our ITS). The training 
data set contained 1,118 students’ interaction logs collected 
from a series of seven prior studies which followed the identi- 
cal procedure and learning materials as the students in this 
study described in Section 5. The same 142 features used 
by EM-IRL were extracted from the logs and used to in- 
duce the policy. In an empirical classroom study, the policy 
was compared with a deep Q-network (DQN) induced pol- 
icy and a random policy. Results showed that the RL policy 
significantly outperformed both of them [21]. 


Once the RL policy was induced, we applied it on the student 
decision-making data (127 students) to see what decision the 
RL policy would make on each step. Then, we calculated 
the matching rate between students’ decisions and the RL 
policy individually for each student. Based on the matching 
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rates, the students were split into three groups via K-means 
clustering, denoted as High, Medium, or Low based on the 
average matching rate of the group. 


Inducing the IRL Policy: Similarly, to investigate whether 
students’ learning strategies could be distinguished from their 
own perspective, we applied IRL to induce a policy from stu- 
dent decision-making data and compared students’ decisions 
with the IRL policy. Given that our data analysis showed 
that most of students learned significantly from ITS training, 
herein, we assumed that a majority of students completed 
the training with the goal to learn. Thus, we expected that 
the group with a higher matching rate with the the IRL 
policy would have better learning performance. 


The IRL policy was induced from the 127 students who were 
given the opportunities to make pedagogical decision during 
training. Herein, the MLIRL algorithm [27] was utilized for 
policy induction. Similar to the RL based method, the IRL 
policy was applied back to students’ data to calculate the 
matching rate between students’ decisions and the IRL pol- 
icy. Then, K-means clustering was applied on the matching 
rate to cluster students into High, Medium, or Low groups. 


7. RESULTS 


7.1 Student Subtyping by EM-IRL 

Fitting students’ data to the EM-IRL framework in Sec- 
tion 3.3, when stop_criteria was met, we got three clusters. 
Table 2 shows the EM-IRL subtyping results. From left to 
right, it shows the students’ subtypes, number of students 
(# Stu), pre-test score (Pre), isomorphic NLG (Iso_NLG), 
NLG, time on the training task (Time), and percentage of 
elicit in students’ decisions (Elicit_Perc). Based on sta- 
tistical analysis, we named the three resulting clusters as: 
learning-oriented, efficient-oriented, and no learning. 


A one-way ANOVA analysis on pre-test scores showed no 
significant difference among the three clusters: F'(2,124) = 
1.36, p = 0.260, 7 = 0.022. This suggested that students 
in the three clusters were balanced in incoming competence. 
To measure students’ learning gain in training, we conduced 
analyses on their Iso_NLG and NLG. A one-way ANOVA 
analysis on Iso.NLG showed a significant difference among 
the three clusters: F'(2,124) = 3.24, p = 0.042, 7 = 0.050. 
Subsequent contrast analysis revealed that learning-oriented 
> no learning: (124) = 2.54, p = 0.012, d = 0.75 and 
efficient-oriented > no learning: t(124) = 2.19, p = 0.030, 
d = 0.54. Similar results were found for NLG in that a one- 
way ANOVA analysis showed a significant difference among 
the three clusters: F'(2,124) = 3.73, p = 0.027, 7 = 0.057. 
Subsequent contrast analysis revealed that learning-oriented 
and efficient-oriented significantly outperformed no learn- 
ing: (124) = 2.73, p = 0.007, d = 0.77 and ¢(124) = 2.15, 
p = 0.033, d = 0.52 respectively. 


In terms of time on task, a one-way ANOVA analysis showed 
a significant difference among the three clusters: F'(2, 124) = 
5.81, p = 0.004, 7 = 0.086. Subsequent contrast analysis 
indicated that learning-oriented took longer time on task 
than the other two clusters: ¢(124) = —3.11, p = 0.002, 
d = 0.58 for efficient-oriented and (124) = 2.37, p = 0.019, 
d = 0.63 for no learning. A contrast analysis on the per- 
centage of elicit in students’ decisions revealed that learning- 


oriented took significantly more elicit actions than no learn- 
ing: t(124) = 2.24, p = 0.027, d = 0.70. 


To summarize, the learning-oriented subtype spent signifi- 
cantly more time than the other two groups on the training 
task and achieved the best performance on both Iso_.NLG 
and NLG (signifiantly higher than no learning). This sug- 
gested that learning-oriented students mainly focused on 
learning the materials, regardless of the time they may spend. 
The efficient-oriented subtype significantly outperformed no 
learning on learning performance and at the same time spent 
significantly less time than learning-oriented. This suggested 
that efficient-oriented students could balance learning gain 
and time on task. Finally, the no learning subtype achieved 
the lowest learning outcomes. 


7.2 Student Subtyping by Other Methods 


7.2.1 Clustering by Traditional Methods 

We compared our EM-IRL with three traditional baseline 
clustering methods, namely K-means on the pre-test score 
(K-means on Pre); K-means on the learning state (142 fea- 
tures) in the final step (K-means on Final Step); K-medoids 
on the DTW distance among trajectories [52], which is calcu- 
lated based on the 142 features (K-medoids on DTW). The 
results are shown in Table 3, with the two columns being 
clustering method and the resulting log-likelihood (LL). 


Overall, results showed that the dynamic-information-based 
clustering approaches (K-medoids on DTW and EM-IRL) 
performed better than static-information-based approaches 
(K-means on Pre and K-means on Final Step). Between the 
two static-information-based approaches, K-means on final 
Step performed better than K-means on pre-test. This is not 
surprising because the state in the final step included infor- 
mation generated during training while the pre-test score 
only included information till the end of pre-test. Between 
the two dynamic-information-based approaches, EM-IRL out- 
performed K-medoids on DTW. A possible reason is that 
EM-IRL took both states and actions into account while K- 
medoids on DTW considered only the states in trajectories. 


7.2.2 Clustering by Matching RL/ IRL Policies 
Results of Matching with the RL Policy: Based on 
the matching rate with the RL policy, we got three clus- 
ters by K-means: High (M = .84,SD = .05), Medium 
(M = .70, SD = .05), and Low (M = .52,SD = .07). A one- 
way ANOVA analysis over the matching rate showed a signif- 
icant difference: F'(2,124) = 339.87, p < 0.0001, 7 = 0.846. 
Subsequent contrast analysis showed that: High > Medium: 
t(124) = 4.38, p < 0.0001, d = 0.99 and Medium > Low: 
t(124) = 8.01, p < 0.0001, d = 1.70. 


A one-way ANOVA analysis on pre-test showed there was no 
significant difference among the three groups: F'(2,124) = 
0.26, p = 0.771, 7 = 0.004. Analyses on Iso_NLG (calcu- 
lated based on pre-test and isomorphic post-test) and NLG 
(calculated based on pre-test and full post-test, which con- 
tains six additional hard problems) also showed no signifi- 
cant difference among the three groups. In terms of time on 
the training task, there was a significant difference among 
the three groups: High (M = 2.40,SD = .50), Medium 
(M = 2.42,5D = .66), and Low (M = 1.88,SD = .40). 
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Table 2: EM-IRL clustering results in ITS environment. 


Subtype #Stu | Pre Iso_NLG NLG | Time Elicit_Pere (%) 
learning-oriented 50 73.9(16.8) 55.9(45.3) 23.4(53.6) 2.52(.70) 87.53(13.40) 
efficient-oriented 64 76.2(14.5) 43.9(92.4) -4.4(127.2) 2.18(.45) 84.93(15.02) 

no learning 13 81.9(17.4) -21.1(212.1) -98.4(340.4) 2.10(.50) 77.06(20.04) 


Table 3: Comparison of the log-likelihood (LL) for 
different clustering methods 


Method LL (x103) 
K-means on Pre -10.68 
K-means on Final Step -9.60 
K-medoids on DTW -8.83 
EM-IRL -6.36 


A one-way AVONA on time shows: F(2,124) = 9.21, p = 
0.0002, 7 = 0.129. Subsequent contrast analysis revealed 
that the High and Medium groups spent significantly more 
time than the Low group: ¢(124) = 3.85, p = 0.0002, 
d = 1.11 and ¢(124) = 3.99, p = 0.0001, d = 0.92, re- 
spectively. An analysis on the percentage of elicit in stu- 
dents’ decisions showed a significant difference among the 
three groups: F'(2,124) = 66.97, p < 0.0001, 7 = 0.519. 
Subsequent contrast analysis revealed that High > Medium: 
t(124) = 4.38, p < 0.0001, d = 0.99 and Medium > Low: 
t(124) = 8.01, p < 0.0001, d = 1.70. 


The results showed that by matching with the RL strategy, 
we could differentiate students’ time-consuming strategies 
from time-efficient strategies. However, it was not able to 
identify the student subtypes that made a difference in the 
learning performance. This suggested the presence of a gap 
between tutor’s and students’ strategies. Specifically, com- 
paring to taking actions following the tutor’s decisions pas- 
sively, the students might prefer actively direct their own 
learning process. Therefore, when deploying the tutor’s 
strategy to students, it might not promote the learning per- 
formance as expected. 


Results of Matching with the IRL Policy: Based on 
the matching rate with the IRL policy, we got three clusters 
by K-means: High (M = .86,5D = .05), Medium (M = 
.71,SD = .05), and Low (M = .54,SD = .06). A one-way 
ANOVA analysis over the matching rate showed a significant 
difference among the three groups: F'(2, 124) = 360.99, p < 
0.0001, 7 = 0.853. Subsequent contrast analysis showed 
that: High > Medium: ¢(124) = 15.92, p < 0.0001, d = 3.37 
and Medium > Low: ¢(124) = 13.52, p < 0.0001, d = 3.23. 


A one-way ANOVA analysis on pre-test showed there was no 
significant difference among the three groups: F'(2,124) = 
1.17, p = 0.314, 7 = 0.019. Analyses on the Iso_NLG 
and NLG also showed no significant difference among the 
three groups. In terms of time on the training task, there 
was a significant difference among the three groups: High 
(M = 2.44,SD = .54), Medium (M = 2.27,SD = .68), 
and Low (M = 2.08,SD = .42). A one-way AVONA on 
time shows: F(2,124) = 3.11, p = 0.048, 7 = 0.048. Sub- 


sequent contrast analysis showed that the High group spent 
significantly more time than the Low group: ¢(124) = 2.43, 
p = 0.017, d = 0.70. An analysis on the percentage of elicit 
in students’ decisions showed a significant difference among 
the three groups: F'(2,124) = 93.92, p < 0.0001, 7 = 0.602. 
Subsequent contrast analysis revealed that High > Medium: 
t(124) = 7.95, p < 0.0001, d = 1.83 and Medium > Low: 
t(124) = 7.08, p < 0.0001, d = 1.43. 


The results showed that IRL based policy matching was able 
to cluster the students’ strategies different in time. However, 
it was unable to learn specific subtype of students whose 
strategy will lead to better learning outcomes. One possible 
reason that the IRL-based analyses could not identify the 
learning-performance-impactful strategies is that a single 
policy was insufficient to effectively generalize the decision- 
making patterns for the overall students. Different students 
might follow heterogeneous decision-making strategies. 


In summary, the results suggested that EM-IRL could effec- 
tively conduct student subtyping reflecting different decision- 
making strategies. As a contrast, clustering by traditional 
methods or by matching RL/IRL policies could not find de- 
sired student subtypes. 


8. CONCLUSIONS 


In this paper, we investigated students’ subtyping via EM- 
IRL. By analyzing students’ subtyping, we aimed at putting 
ourselves in the shoes of students to better understand their 
decision-making. To evaluate the performance of EM-IRL, 
we first applied it to three simulation environments, where 
the EM-IRL displayed robust performance to accurately clus- 
ter the trajectories with different strategies. Given the ac- 
curate clustering results in simulators, we were more confi- 
dent to further apply EM-IRL to real world longitudinal stu- 
dents’ logs collected from an ITS. The results suggested that 
the EM-IRL could effectively group students with different 
subtypes, e.g., learning-oriented, efficient-oriented, and no- 
learning. As a contrast, clustering by traditional methods 
or by matching RL/IRL policies could not find desired sub- 
types. The subtyping results showed the potential of provid- 
ing tutors evidence to give more customized interventions to 
better assist students’ learning. In the future, we will con- 
duct early clustering to detect students’ strategies as early 
as possible. Besides, empirical studies will be carried out to 
evaluate the effectiveness of subtyping-based interventions 
to improve the targeted group of students. 
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